AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).
A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.
You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.
To predict whether a liability customer will buy personal loans, to understand which customer attributes are most significant in driving purchases, and identify which segment of customers to target more.
ID: Customer IDAge: Customer’s age in completed yearsExperience: #years of professional experienceIncome: Annual income of the customer (in thousand dollars)ZIP Code: Home Address ZIP code.Family: the Family size of the customerCCAvg: Average spending on credit cards per month (in thousand dollars)Education: Education Level. 1: Undergrad; 2: Graduate;3: Advanced/ProfessionalMortgage: Value of house mortgage if any. (in thousand dollars)Personal_Loan: Did this customer accept the personal loan offered in the last campaign? (0: No, 1: Yes)Securities_Account: Does the customer have securities account with the bank? (0: No, 1: Yes)CD_Account: Does the customer have a certificate of deposit (CD) account with the bank? (0: No, 1: Yes)Online: Do customers use internet banking facilities? (0: No, 1: Yes)CreditCard: Does the customer use a credit card issued by any other Bank (excluding All life Bank)? (0: No, 1: Yes)# Installing the libraries with the specified version.
# !pip install numpy==1.25.2 pandas==1.5.3 matplotlib==3.7.1 seaborn==0.13.1 scikit-learn==1.2.2 sklearn-pandas==2.2.0 -q --user
Note: After running the above cell, kindly restart the notebook kernel and run all cells sequentially from the start again.
# import libraries for data manipulation
import numpy as np
import pandas as pd
# import libraries for data visualization
import seaborn as sns
import matplotlib.pyplot as plt
# Library to split data
from sklearn.model_selection import train_test_split
# To build model for prediction
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
# To tune different models
from sklearn.model_selection import GridSearchCV
# To get diferent metric scores
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
make_scorer,
)
# Mount the drive for Google Coalb
from google.colab import drive
drive.mount('/content/drive/')
Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).
# Read the Loan Modelling csv file
df = pd.read_csv('/content/drive/MyDrive/AIML_LoanCampaign/Loan_Modelling.csv')
# Returns first 5 rows of the dataframe
df.head()
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 25 | 1 | 49 | 91107 | 4 | 1.6 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 2 | 45 | 19 | 34 | 90089 | 3 | 1.5 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2 | 3 | 39 | 15 | 11 | 94720 | 1 | 1.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 4 | 35 | 9 | 100 | 94112 | 1 | 2.7 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 5 | 35 | 8 | 45 | 91330 | 4 | 1.0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 |
# Returns last 5 rows of the dataframe
df.tail()
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4995 | 4996 | 29 | 3 | 40 | 92697 | 1 | 1.9 | 3 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4996 | 4997 | 30 | 4 | 15 | 92037 | 4 | 0.4 | 1 | 85 | 0 | 0 | 0 | 1 | 0 |
| 4997 | 4998 | 63 | 39 | 24 | 93023 | 2 | 0.3 | 3 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4998 | 4999 | 65 | 40 | 49 | 90034 | 3 | 0.5 | 2 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4999 | 5000 | 28 | 4 | 83 | 92612 | 3 | 0.8 | 1 | 0 | 0 | 0 | 0 | 1 | 1 |
df.shape #returns dimension of the dataframe
(5000, 14)
df.info() # returns the summary of dataframe including the index dtype and columns, non-null values and memory usage.
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 5000 non-null int64 1 Age 5000 non-null int64 2 Experience 5000 non-null int64 3 Income 5000 non-null int64 4 ZIPCode 5000 non-null int64 5 Family 5000 non-null int64 6 CCAvg 5000 non-null float64 7 Education 5000 non-null int64 8 Mortgage 5000 non-null int64 9 Personal_Loan 5000 non-null int64 10 Securities_Account 5000 non-null int64 11 CD_Account 5000 non-null int64 12 Online 5000 non-null int64 13 CreditCard 5000 non-null int64 dtypes: float64(1), int64(13) memory usage: 547.0 KB
df.isna().values.any() # Checks if there is any null value in any column
False
df.describe().T # returns stats for all numerical columns
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| ID | 5000.0 | 2500.500000 | 1443.520003 | 1.0 | 1250.75 | 2500.5 | 3750.25 | 5000.0 |
| Age | 5000.0 | 45.338400 | 11.463166 | 23.0 | 35.00 | 45.0 | 55.00 | 67.0 |
| Experience | 5000.0 | 20.104600 | 11.467954 | -3.0 | 10.00 | 20.0 | 30.00 | 43.0 |
| Income | 5000.0 | 73.774200 | 46.033729 | 8.0 | 39.00 | 64.0 | 98.00 | 224.0 |
| ZIPCode | 5000.0 | 93169.257000 | 1759.455086 | 90005.0 | 91911.00 | 93437.0 | 94608.00 | 96651.0 |
| Family | 5000.0 | 2.396400 | 1.147663 | 1.0 | 1.00 | 2.0 | 3.00 | 4.0 |
| CCAvg | 5000.0 | 1.937938 | 1.747659 | 0.0 | 0.70 | 1.5 | 2.50 | 10.0 |
| Education | 5000.0 | 1.881000 | 0.839869 | 1.0 | 1.00 | 2.0 | 3.00 | 3.0 |
| Mortgage | 5000.0 | 56.498800 | 101.713802 | 0.0 | 0.00 | 0.0 | 101.00 | 635.0 |
| Personal_Loan | 5000.0 | 0.096000 | 0.294621 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| Securities_Account | 5000.0 | 0.104400 | 0.305809 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| CD_Account | 5000.0 | 0.060400 | 0.238250 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| Online | 5000.0 | 0.596800 | 0.490589 | 0.0 | 0.00 | 1.0 | 1.00 | 1.0 |
| CreditCard | 5000.0 | 0.294000 | 0.455637 | 0.0 | 0.00 | 0.0 | 1.00 | 1.0 |
# Lets check for any duplicate values , and if there is any we will remove them
df[df.duplicated()].count() # check for any duplicate values
ID 0 Age 0 Experience 0 Income 0 ZIPCode 0 Family 0 CCAvg 0 Education 0 Mortgage 0 Personal_Loan 0 Securities_Account 0 CD_Account 0 Online 0 CreditCard 0 dtype: int64
df.nunique() # returns unique values for each column
ID 5000 Age 45 Experience 47 Income 162 ZIPCode 467 Family 4 CCAvg 108 Education 3 Mortgage 347 Personal_Loan 2 Securities_Account 2 CD_Account 2 Online 2 CreditCard 2 dtype: int64
There are 5000 rows and 14 columns in the given Loan Campaign Dataframe
Datatypes used in this Loan dataset are all numeric All columns have no missing (i.e., non-null) values.
13 columns are Integer and only CCAvg is a float quantity
Total 14 Columns which indicate 13 features for 1 target
There are no NULL values and duplicate values
Minimum value for Experince is negative(-3) which is not possible. Needs Treatment
Though ZIPCode is numeric column - it could be categorical here - May need treatment
Total memory used = 547.0 KB
df.nunique() # returns unique values for each column
ID 5000 Age 45 Experience 47 Income 162 ZIPCode 467 Family 4 CCAvg 108 Education 3 Mortgage 347 Personal_Loan 2 Securities_Account 2 CD_Account 2 Online 2 CreditCard 2 dtype: int64
df["Age"].unique() # returns unique Age of customers
array([25, 45, 39, 35, 37, 53, 50, 34, 65, 29, 48, 59, 67, 60, 38, 42, 46,
55, 56, 57, 44, 36, 43, 40, 30, 31, 51, 32, 61, 41, 28, 49, 47, 62,
58, 54, 33, 27, 66, 24, 52, 26, 64, 63, 23])
df["Experience"].unique() # returns unique Experience for customers
array([ 1, 19, 15, 9, 8, 13, 27, 24, 10, 39, 5, 23, 32, 41, 30, 14, 18,
21, 28, 31, 11, 16, 20, 35, 6, 25, 7, 12, 26, 37, 17, 2, 36, 29,
3, 22, -1, 34, 0, 38, 40, 33, 4, -2, 42, -3, 43])
df["Income"].unique() # returns unique Income of customers
array([ 49, 34, 11, 100, 45, 29, 72, 22, 81, 180, 105, 114, 40,
112, 130, 193, 21, 25, 63, 62, 43, 152, 83, 158, 48, 119,
35, 41, 18, 50, 121, 71, 141, 80, 84, 60, 132, 104, 52,
194, 8, 131, 190, 44, 139, 93, 188, 39, 125, 32, 20, 115,
69, 85, 135, 12, 133, 19, 82, 109, 42, 78, 51, 113, 118,
64, 161, 94, 15, 74, 30, 38, 9, 92, 61, 73, 70, 149,
98, 128, 31, 58, 54, 124, 163, 24, 79, 134, 23, 13, 138,
171, 168, 65, 10, 148, 159, 169, 144, 165, 59, 68, 91, 172,
55, 155, 53, 89, 28, 75, 170, 120, 99, 111, 33, 129, 122,
150, 195, 110, 101, 191, 140, 153, 173, 174, 90, 179, 145, 200,
183, 182, 88, 160, 205, 164, 14, 175, 103, 108, 185, 204, 154,
102, 192, 202, 162, 142, 95, 184, 181, 143, 123, 178, 198, 201,
203, 189, 151, 199, 224, 218])
df["ZIPCode"].unique() # returns unique Zipcode of customers
array([91107, 90089, 94720, 94112, 91330, 92121, 91711, 93943, 93023,
94710, 90277, 93106, 94920, 91741, 95054, 95010, 94305, 91604,
94015, 90095, 91320, 95521, 95064, 90064, 94539, 94104, 94117,
94801, 94035, 92647, 95814, 94114, 94115, 92672, 94122, 90019,
95616, 94065, 95014, 91380, 95747, 92373, 92093, 94005, 90245,
95819, 94022, 90404, 93407, 94523, 90024, 91360, 95670, 95123,
90045, 91335, 93907, 92007, 94606, 94611, 94901, 92220, 93305,
95134, 94612, 92507, 91730, 94501, 94303, 94105, 94550, 92612,
95617, 92374, 94080, 94608, 93555, 93311, 94704, 92717, 92037,
95136, 94542, 94143, 91775, 92703, 92354, 92024, 92831, 92833,
94304, 90057, 92130, 91301, 92096, 92646, 92182, 92131, 93720,
90840, 95035, 93010, 94928, 95831, 91770, 90007, 94102, 91423,
93955, 94107, 92834, 93117, 94551, 94596, 94025, 94545, 95053,
90036, 91125, 95120, 94706, 95827, 90503, 90250, 95817, 95503,
93111, 94132, 95818, 91942, 90401, 93524, 95133, 92173, 94043,
92521, 92122, 93118, 92697, 94577, 91345, 94123, 92152, 91355,
94609, 94306, 96150, 94110, 94707, 91326, 90291, 92807, 95051,
94085, 92677, 92614, 92626, 94583, 92103, 92691, 92407, 90504,
94002, 95039, 94063, 94923, 95023, 90058, 92126, 94118, 90029,
92806, 94806, 92110, 94536, 90623, 92069, 92843, 92120, 95605,
90740, 91207, 95929, 93437, 90630, 90034, 90266, 95630, 93657,
92038, 91304, 92606, 92192, 90745, 95060, 94301, 92692, 92101,
94610, 90254, 94590, 92028, 92054, 92029, 93105, 91941, 92346,
94402, 94618, 94904, 93077, 95482, 91709, 91311, 94509, 92866,
91745, 94111, 94309, 90073, 92333, 90505, 94998, 94086, 94709,
95825, 90509, 93108, 94588, 91706, 92109, 92068, 95841, 92123,
91342, 90232, 92634, 91006, 91768, 90028, 92008, 95112, 92154,
92115, 92177, 90640, 94607, 92780, 90009, 92518, 91007, 93014,
94024, 90027, 95207, 90717, 94534, 94010, 91614, 94234, 90210,
95020, 92870, 92124, 90049, 94521, 95678, 95045, 92653, 92821,
90025, 92835, 91910, 94701, 91129, 90071, 96651, 94960, 91902,
90033, 95621, 90037, 90005, 93940, 91109, 93009, 93561, 95126,
94109, 93107, 94591, 92251, 92648, 92709, 91754, 92009, 96064,
91103, 91030, 90066, 95403, 91016, 95348, 91950, 95822, 94538,
92056, 93063, 91040, 92661, 94061, 95758, 96091, 94066, 94939,
95138, 95762, 92064, 94708, 92106, 92116, 91302, 90048, 90405,
92325, 91116, 92868, 90638, 90747, 93611, 95833, 91605, 92675,
90650, 95820, 90018, 93711, 95973, 92886, 95812, 91203, 91105,
95008, 90016, 90035, 92129, 90720, 94949, 90041, 95003, 95192,
91101, 94126, 90230, 93101, 91365, 91367, 91763, 92660, 92104,
91361, 90011, 90032, 95354, 94546, 92673, 95741, 95351, 92399,
90274, 94087, 90044, 94131, 94124, 95032, 90212, 93109, 94019,
95828, 90086, 94555, 93033, 93022, 91343, 91911, 94803, 94553,
95211, 90304, 92084, 90601, 92704, 92350, 94705, 93401, 90502,
94571, 95070, 92735, 95037, 95135, 94028, 96003, 91024, 90065,
95405, 95370, 93727, 92867, 95821, 94566, 95125, 94526, 94604,
96008, 93065, 96001, 95006, 90639, 92630, 95307, 91801, 94302,
91710, 93950, 90059, 94108, 94558, 93933, 92161, 94507, 94575,
95449, 93403, 93460, 95005, 93302, 94040, 91401, 95816, 92624,
95131, 94965, 91784, 91765, 90280, 95422, 95518, 95193, 92694,
90275, 90272, 91791, 92705, 91773, 93003, 90755, 96145, 94703,
96094, 95842, 94116, 90068, 94970, 90813, 94404, 94598])
len(df["ZIPCode"].unique()) # returns total length for Zipcodes (unique)
467
df["Family"].unique() # returns unique Family of customers
array([4, 3, 1, 2])
df["CCAvg"].unique() # returns unique Credit Card Average spend for all customers
array([ 1.6 , 1.5 , 1. , 2.7 , 0.4 , 0.3 , 0.6 , 8.9 , 2.4 ,
0.1 , 3.8 , 2.5 , 2. , 4.7 , 8.1 , 0.5 , 0.9 , 1.2 ,
0.7 , 3.9 , 0.2 , 2.2 , 3.3 , 1.8 , 2.9 , 1.4 , 5. ,
2.3 , 1.1 , 5.7 , 4.5 , 2.1 , 8. , 1.7 , 0. , 2.8 ,
3.5 , 4. , 2.6 , 1.3 , 5.6 , 5.2 , 3. , 4.6 , 3.6 ,
7.2 , 1.75, 7.4 , 2.67, 7.5 , 6.5 , 7.8 , 7.9 , 4.1 ,
1.9 , 4.3 , 6.8 , 5.1 , 3.1 , 0.8 , 3.7 , 6.2 , 0.75,
2.33, 4.9 , 0.67, 3.2 , 5.5 , 6.9 , 4.33, 7.3 , 4.2 ,
4.4 , 6.1 , 6.33, 6.6 , 5.3 , 3.4 , 7. , 6.3 , 8.3 ,
6. , 1.67, 8.6 , 7.6 , 6.4 , 10. , 5.9 , 5.4 , 8.8 ,
1.33, 9. , 6.7 , 4.25, 6.67, 5.8 , 4.8 , 3.25, 5.67,
8.5 , 4.75, 4.67, 3.67, 8.2 , 3.33, 5.33, 9.3 , 2.75])
df["Education"].unique() # returns unique Education of customers
array([1, 2, 3])
df["Mortgage"].unique() # returns unique Mortgage of customers
array([ 0, 155, 104, 134, 111, 260, 163, 159, 97, 122, 193, 198, 285,
412, 153, 211, 207, 240, 455, 112, 336, 132, 118, 174, 126, 236,
166, 136, 309, 103, 366, 101, 251, 276, 161, 149, 188, 116, 135,
244, 164, 81, 315, 140, 95, 89, 90, 105, 100, 282, 209, 249,
91, 98, 145, 150, 169, 280, 99, 78, 264, 113, 117, 325, 121,
138, 77, 158, 109, 131, 391, 88, 129, 196, 617, 123, 167, 190,
248, 82, 402, 360, 392, 185, 419, 270, 148, 466, 175, 147, 220,
133, 182, 290, 125, 124, 224, 141, 119, 139, 115, 458, 172, 156,
547, 470, 304, 221, 108, 179, 271, 378, 176, 76, 314, 87, 203,
180, 230, 137, 152, 485, 300, 272, 144, 94, 208, 275, 83, 218,
327, 322, 205, 227, 239, 85, 160, 364, 449, 75, 107, 92, 187,
355, 106, 587, 214, 307, 263, 310, 127, 252, 170, 265, 177, 305,
372, 79, 301, 232, 289, 212, 250, 84, 130, 303, 256, 259, 204,
524, 157, 231, 287, 247, 333, 229, 357, 361, 294, 86, 329, 142,
184, 442, 233, 215, 394, 475, 197, 228, 297, 128, 241, 437, 178,
428, 162, 234, 257, 219, 337, 382, 397, 181, 120, 380, 200, 433,
222, 483, 154, 171, 146, 110, 201, 277, 268, 237, 102, 93, 354,
195, 194, 238, 226, 318, 342, 266, 114, 245, 341, 421, 359, 565,
319, 151, 267, 601, 567, 352, 284, 199, 80, 334, 389, 186, 246,
589, 242, 143, 323, 535, 293, 398, 343, 255, 311, 446, 223, 262,
422, 192, 217, 168, 299, 505, 400, 165, 183, 326, 298, 569, 374,
216, 191, 408, 406, 452, 432, 312, 477, 396, 582, 358, 213, 467,
331, 295, 235, 635, 385, 328, 522, 496, 415, 461, 344, 206, 368,
321, 296, 373, 292, 383, 427, 189, 202, 96, 429, 431, 286, 508,
210, 416, 553, 403, 225, 500, 313, 410, 273, 381, 330, 345, 253,
258, 351, 353, 308, 278, 464, 509, 243, 173, 481, 281, 306, 577,
302, 405, 571, 581, 550, 283, 612, 590, 541])
# returns uniue negative experiences
df[df["Experience"] < 0]["Experience"].unique()
array([-1, -2, -3])
# Returns total count of unique negative experiences
df[df["Experience"] < 0]["Experience"].value_counts()
Experience -1 33 -2 15 -3 4 Name: count, dtype: int64
# Correcting the experience values - making them to absolute number assuming its is a data entry error
df["Experience"].replace(-1, 1, inplace=True)
df["Experience"].replace(-2, 2, inplace=True)
df["Experience"].replace(-3, 3, inplace=True)
df.describe().T # Returns Statistical Summary after Experience column is fixed.
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| ID | 5000.0 | 2500.500000 | 1443.520003 | 1.0 | 1250.75 | 2500.5 | 3750.25 | 5000.0 |
| Age | 5000.0 | 45.338400 | 11.463166 | 23.0 | 35.00 | 45.0 | 55.00 | 67.0 |
| Experience | 5000.0 | 20.134600 | 11.415189 | 0.0 | 10.00 | 20.0 | 30.00 | 43.0 |
| Income | 5000.0 | 73.774200 | 46.033729 | 8.0 | 39.00 | 64.0 | 98.00 | 224.0 |
| ZIPCode | 5000.0 | 93169.257000 | 1759.455086 | 90005.0 | 91911.00 | 93437.0 | 94608.00 | 96651.0 |
| Family | 5000.0 | 2.396400 | 1.147663 | 1.0 | 1.00 | 2.0 | 3.00 | 4.0 |
| CCAvg | 5000.0 | 1.937938 | 1.747659 | 0.0 | 0.70 | 1.5 | 2.50 | 10.0 |
| Education | 5000.0 | 1.881000 | 0.839869 | 1.0 | 1.00 | 2.0 | 3.00 | 3.0 |
| Mortgage | 5000.0 | 56.498800 | 101.713802 | 0.0 | 0.00 | 0.0 | 101.00 | 635.0 |
| Personal_Loan | 5000.0 | 0.096000 | 0.294621 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| Securities_Account | 5000.0 | 0.104400 | 0.305809 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| CD_Account | 5000.0 | 0.060400 | 0.238250 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| Online | 5000.0 | 0.596800 | 0.490589 | 0.0 | 0.00 | 1.0 | 1.00 | 1.0 |
| CreditCard | 5000.0 | 0.294000 | 0.455637 | 0.0 | 0.00 | 0.0 | 1.00 | 1.0 |
Customers are in range 23 - 67 Years old
Max Experience is 43 Years with mean and median is 20 years
Minimum Income is 8K and max income is 224K. Mean is 73K ans median is 64K. We may see some outliers for the Salary
There are 52 rows in total where Experience in Ages is less than 0,
Max mortgage taken is 635K wheras median is 0 - shows outlier here
Average customer spends 0k to 10K on credit card where mean is little less than 2K (1.93) and median of 1.5K
There are total 467 unique Zip Codes
Outliers are expected for Income, Mortgage and CCAvg (May/May not - need tratment)
Negative Experiences are mostly for the 23 - 29 age group people
Looks like all negative experiences are data entry error
All negative experiences are replaced by its absolute values assuming its a data entry error
# Returns percentage of customers who has CD Account
round((df[df['CD_Account'] == 1]['ID'].count()/df['ID'].nunique()) *100,2)
6.04
# Returns percentage of customers who has Credit Card
round((df[df['CreditCard'] == 1]['ID'].count()/df['ID'].nunique()) *100,2)
29.4
# Returns percentage of customers who uses Internet banking facilities
round((df[df['Online'] == 1]['ID'].count()/df['ID'].nunique()) *100,2)
59.68
# Returns percentage of customers who has Securities Account with the bank
round((df[df['Securities_Account'] == 1]['ID'].count()/df['ID'].nunique()) *100,2)
10.44
# Returns percentage of customers who accepted personal loan
round((df[df['Personal_Loan'] == 1]['ID'].count()/df['ID'].nunique()) *100,2)
9.6
# Returns percentage of customers who has mortgage
round((df[df['Mortgage'] == 1]['ID'].count()/df['ID'].nunique()) *100,2)
0.0
# Returns count of family memebers
df.groupby('Family')['ID'].count()
Family 1 1472 2 1296 3 1010 4 1222 Name: ID, dtype: int64
# Returns education count
df.groupby('Education')['ID'].count()
Education 1 2096 2 1403 3 1501 Name: ID, dtype: int64
# Returns Percent of customers who spent less than 5K
round((df[df['CCAvg'] < 5]['ID'].count()/df['ID'].nunique()) *100,2)
92.72
# Returns Percent of customers who spent less than 2K
round((df[df['CCAvg'] < 2]['ID'].count()/df['ID'].nunique()) *100,2)
61.18
Only 6.04 percent customers have the CD Account
Only 29.4 percent customers use credit cards issued by other banls
Approx 60 percent customers use online banking facitlities
Only 9.6 percent customers have borrowed loan after the Campaign
Only 10.44 percent customers have the Securities Account
More than 90% customers spend less than 5K
More than 60% customers spend less than 2K
Questions:
# function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to show the density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
# function to create labeled barplots
def labeled_barplot(data, feature, hue, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
hue = hue,
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
# calls function to plot a boxplot and a histogram along the same scale for Age
histogram_boxplot(df, "Age", kde=True)
Age is well distributed in the dataset but has 5 spikes
Minimum is 23 and Max is 67
Mean and Median being 45
# calls function to plot a boxplot and a histogram along the same scale for Experience
histogram_boxplot(df, "Experience")
Experience is well distributed with 4 spikes
Minimum Experience is 0 years wheras Max experience is 43 years
Mean and Median - both close to 20
# calls function to plot a boxplot and a histogram along the same scale for Income
histogram_boxplot(df, "Income", kde = True)
# returns percent of customers which have income less than 100K
df[df['Income'] < 100]['ID'].count()/df['ID'].count()
0.7556
Income is heavily right Skewed. There are more customers with low income.
75 percent of customers have income less than 100K
Income ranges from 8K to 224K
Max Income(224K) is much higher than Q3 (98K)
Mean is 73K wheras Median is 64K (Median < Mean)
We see Ouliers for the Income
# calls function to plot a boxplot and a histogram along the same scale for Credit Card Spend Average
histogram_boxplot(df, "CCAvg")
round((df[df['CCAvg'] > 5]['ID'].count()/df['ID'].nunique()) *100,2)
6.92
round((df[df['CCAvg'] <=2 ]['ID'].count()/df['ID'].nunique()) *100,2)
64.94
CCAvg is heavily right Skewed. 50 percent cusomers spend less than 2K on credit card
Range of CCAvg vries from 0 to 10K Income ranges from 8K to 224K
Max CCAvg(10K) is much higher than Q3 (2.5K)
Less than 65 percent of customers spends 2K or less per month
less than 7 percent of customers spend more than 5000 dollars
Mean is 1.93K wheras Median is 1.5K
Lot of outliers in CCAvg on higher side
# calls function to plot a boxplot and a histogram along the same scale for Mortgage
histogram_boxplot(df, "Mortgage")
Mortgage is heavily right Skewed.
Range of Mortgage vries from 0 to 635K
Max mortgage(635K) is much higher than Q3 (101K)
Median is Zero K
Lot of outliers in mortgage on higher side
# categorical plot for education
labeled_barplot(df, "Education", "Personal_Loan", perc=True)
Approx 40% customers are undergrads whereas Grads are little less than Advanced Professionals
Personal loan is more with Graduates and Advanced Professional
# categorical plot for Family
labeled_barplot(df, "Family", "Personal_Loan", perc=True)
Customers with Family size of 3 and 4 has a slight more chance of accepting personal loan
# categorical plot for CreditCard
labeled_barplot(df, "CreditCard", "Personal_Loan", perc=True)
little less than 64% customers do not have credit cards from other bank also dont have personal loan
# categorical plot for Online
labeled_barplot(df, "Online", "Personal_Loan", perc=True)
53.9% customers have Online banking enabled
# categorical plot for Personal_Loan
labeled_barplot(df, "Personal_Loan", "Personal_Loan", perc=True)
Very few customers accepted the personal loan in the campaign
Only 9.6% customer have accepted the personal loan
# categorical plot for Securities Account
labeled_barplot(df, "Securities_Account", "Personal_Loan", perc=True)
Most of the customers do not have securities account Approx little over 500 customers hold the Securities account in the bank
81.2% customers do not have Security Account
# categorical plot for CD_account
labeled_barplot(df, "CD_Account", "Personal_Loan", perc=True)
87.2% customers do not have Certified Deposits
CD_Account holders have more chance of accepting Personal loan
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 3, 3))
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
### function to plot distributions wrt target
def distribution_plot_wrt_target(data, predictor, target):
fig, axs = plt.subplots(2, 2, figsize=(8, 6))
target_uniq = data[target].unique()
axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
sns.histplot(
data=data[data[target] == target_uniq[0]],
x=predictor,
kde=True,
ax=axs[0, 0],
color="teal",
stat="density",
)
axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
sns.histplot(
data=data[data[target] == target_uniq[1]],
x=predictor,
kde=True,
ax=axs[0, 1],
color="orange",
stat="density",
)
axs[1, 0].set_title("Boxplot w.r.t target")
sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0])
#sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")
axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
sns.boxplot(
data=data,
x=target,
y=predictor,
ax=axs[1, 1],
showfliers=False,
)
plt.tight_layout()
plt.show()
# corelation among variables
plt.figure(figsize=(15, 7))
sns.heatmap(df.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral") #Returns the heatmap of the data
plt.show()
# Pair plot for continuous variables
sns.pairplot(data=df, vars=['Age', 'Income', 'Mortgage', 'CCAvg' ], hue='Personal_Loan');
Age and Experience are highly corelated and hence experience column can be dropped.
Income and Credit Card average are positively corelated
Customers who have more income are more likely to borrow personal loan as they are positively corelated
Other factors to consider for Personal loan could be CD_Account apart from Income and CCAvg
# calls stacked bar plot function for different categorical parameters wrt Personal loan
stacked_barplot(df, "Family", "Personal_Loan")
stacked_barplot(df, "Education", "Personal_Loan")
stacked_barplot(df, "Securities_Account", "Personal_Loan")
stacked_barplot(df, "CD_Account", "Personal_Loan")
stacked_barplot(df, "Online", "Personal_Loan")
stacked_barplot(df, "CreditCard", "Personal_Loan")
Personal_Loan 0 1 All Family All 4520 480 5000 4 1088 134 1222 3 877 133 1010 1 1365 107 1472 2 1190 106 1296 ------------------------------------------------------------------------------------------------------------------------
Personal_Loan 0 1 All Education All 4520 480 5000 3 1296 205 1501 2 1221 182 1403 1 2003 93 2096 ------------------------------------------------------------------------------------------------------------------------
Personal_Loan 0 1 All Securities_Account All 4520 480 5000 0 4058 420 4478 1 462 60 522 ------------------------------------------------------------------------------------------------------------------------
Personal_Loan 0 1 All CD_Account All 4520 480 5000 0 4358 340 4698 1 162 140 302 ------------------------------------------------------------------------------------------------------------------------
Personal_Loan 0 1 All Online All 4520 480 5000 1 2693 291 2984 0 1827 189 2016 ------------------------------------------------------------------------------------------------------------------------
Personal_Loan 0 1 All CreditCard All 4520 480 5000 0 3193 337 3530 1 1327 143 1470 ------------------------------------------------------------------------------------------------------------------------
# calls distribution plot function for different parameters wrt Personal loan
distribution_plot_wrt_target(df, "Age", "Personal_Loan")
distribution_plot_wrt_target(df, "Experience", "Personal_Loan")
distribution_plot_wrt_target(df, "Income", "Personal_Loan")
distribution_plot_wrt_target(df, "ZIPCode", "Personal_Loan")
distribution_plot_wrt_target(df, "CCAvg", "Personal_Loan")
distribution_plot_wrt_target(df, "Mortgage", "Personal_Loan")
Customers with family size of 3-4 has taken personal loan than family size of 1 or 2
Cusomers with more Advanced/Professional or Graduate are little more likelier to take the loan than UnderGrads
Customers having higher income are more likeler to take personal loan
Majority customers who have high income (100K and above) have borrowed personal loans
Customers have no patterns with Personal loan vs age and experience
Customers with higher mortgage are more likelier to take personal loan
Customers with higher expenditure (CCAvg) are more likelier to take personal loan
Approx 30 percent customers with CD Accounts have borrowed personal loan
Approx 87% customers do not own CD Account (4358/5000) and have not borrowed personal loan
Customers who use online/internet banking faciliites have no impact on Personal loan
Customers who did have personal loan is less liklier to use credit cards from other banks
sns.boxplot(data=df,y='Income',x='Education',hue='Personal_Loan');
As Education level increases, Mean Income also increases for the Customers who have personal loans
sns.boxplot(data=df,y='Income',x='Family',hue='Personal_Loan');
Income level among all Family groups is significantly higher for customers who have a Personal Loan.
There are several outliers in Family size 1 and 2 for customers who don't have a Personal loan compared to the rest.
sns.boxplot(data=df,y='Income',x='CD_Account',hue='Personal_Loan');
High Income customers own CD_accounts than the low income one if we ignore the outliers
More CD_accounts customers are more likely to borrow personal loan
sns.boxplot(data=df,y='Mortgage',x='Family',hue='Personal_Loan');
As Family size increases, Customers are more likely to borrow personal loan if we ignore ignoring outliers
sns.boxplot(data=df,y='Mortgage',x='Education',hue='Personal_Loan');
Education level 1(undergrads) have more mortgage than grad and professional level ignoring outliers
sns.boxplot(data=df,y='Mortgage',x='CD_Account',hue='Personal_Loan');
Observations on Patterns People having higher income have taken personal loan
People with 2 - 4 family members are likelier to take personal loan
People with high mortgages opted for loan.
People with higher credit card average opted for personal loan
People with higher mortgage opted for personal loan
Number of Customers with Advanced/Professional education level has borrowed personal loan more than Graduated and Under-grads
Number of Customers with Family Size 3 or More has borrowed personal loan more than other people
60 of those who had Personal loan with the bank also had Securities Account.
Almost 50% of customers having Certified Deposit, had borrowed Personal Loan. However, 4358 customers out of 5000, do not have Certified Deposit Account and did not borrow Personal Loan, which means if a customer does not have a CD Account, is likely not to take Personal Loan
Majority customers who did have Personal Loan with the bank did not use Credit Card from other banks.
Questions:
What is the distribution of mortgage attribute?
Mortgage is heavily right Skewed. Range of Mortgage vries from 0 to 635K Max mortgage(635K) is much higher than Q3 (101K) Median is Zero K Lot of outliers in mortgage on higher side
Are there any noticeable patterns or outliers in the distribution?
Yes we do see outliers for Income, Mortgage and CCavg Income and CCAvg are higjly positively corelated (0.67) - Customers spend more if the income is high
How many customers have credit cards?
Total 1470 customers have the credit card from other bank. Out of 1470 - 1327 customers do not have personal loan and 143 customer do have personal loan
What are the attributes that have a strong correlation with the target attribute (personal loan)?
Storng Positive corelation for the attribut Personal_loan is with Income (close to 0.5) followed by CCAvg (0.37)
How does a customer's interest in purchasing a loan vary with their age?
Age has no corelation with purchasing a personal_loan
How does a customer's interest in purchasing a loan vary with their education? Cusomers with more Advanced/Professional or Graduate are little more likelier to take the loan than UnderGrads
Customer with Advanced degree (3) have approx 15% acceptance rate for Personal loan Customer with Advanced degree (2) have approx 14.9% acceptance rate for Personal loan Customer with Advanced degree (1) have approx 4.64% acceptance rate for Personal loan
dfcopy= df.copy() # create a copy of data in case we need to restore
dfcopy.info() # Test if copy is all good
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 5000 non-null int64 1 Age 5000 non-null int64 2 Experience 5000 non-null int64 3 Income 5000 non-null int64 4 ZIPCode 5000 non-null int64 5 Family 5000 non-null int64 6 CCAvg 5000 non-null float64 7 Education 5000 non-null int64 8 Mortgage 5000 non-null int64 9 Personal_Loan 5000 non-null int64 10 Securities_Account 5000 non-null int64 11 CD_Account 5000 non-null int64 12 Online 5000 non-null int64 13 CreditCard 5000 non-null int64 dtypes: float64(1), int64(13) memory usage: 547.0 KB
# Dropping ID as ita unique ID
# Dropping Experience as Age and Experience are highly corelated
# ID is a unique identifier not dependent on personal Loan
dfcopy.drop(['Experience','ID'], axis=1,inplace=True)
# Dropiing Zipcode as it doesnt provide much insight information
dfcopy.drop(['ZIPCode'], axis=1,inplace=True)
dfcopy.describe()
| Age | Income | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.000000 | 5000.00000 | 5000.000000 | 5000.000000 |
| mean | 45.338400 | 73.774200 | 2.396400 | 1.937938 | 1.881000 | 56.498800 | 0.096000 | 0.104400 | 0.06040 | 0.596800 | 0.294000 |
| std | 11.463166 | 46.033729 | 1.147663 | 1.747659 | 0.839869 | 101.713802 | 0.294621 | 0.305809 | 0.23825 | 0.490589 | 0.455637 |
| min | 23.000000 | 8.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 |
| 25% | 35.000000 | 39.000000 | 1.000000 | 0.700000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 |
| 50% | 45.000000 | 64.000000 | 2.000000 | 1.500000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 1.000000 | 0.000000 |
| 75% | 55.000000 | 98.000000 | 3.000000 | 2.500000 | 3.000000 | 101.000000 | 0.000000 | 0.000000 | 0.00000 | 1.000000 | 1.000000 |
| max | 67.000000 | 224.000000 | 4.000000 | 10.000000 | 3.000000 | 635.000000 | 1.000000 | 1.000000 | 1.00000 | 1.000000 | 1.000000 |
# outlier detection using boxplot
numeric_columns = dfcopy.select_dtypes(include=np.number).columns.tolist()
plt.figure(figsize=(15, 12))
for i, variable in enumerate(numeric_columns):
plt.subplot(4, 4, i + 1)
plt.boxplot(df[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
# functions to treat outliers
def treat_outliers(data, column):
"""
Treats outliers in a variable
df: dataframe
col: dataframe column
"""
Q1 = data[column].quantile(0.25) # 25th quantile
Q3 = data[column].quantile(0.75) # 75th quantile
IQR = Q3 - Q1
Lower_Whisker = Q1 - 1.5 * IQR
Upper_Whisker = Q3 + 1.5 * IQR
# all the values smaller than Lower_Whisker will be assigned the value of Lower_Whisker
# all the values greater than Upper_Whisker will be assigned the value of Upper_Whisker
data[column] = np.clip(data[column], Lower_Whisker, Upper_Whisker)
return data
# Treat outliers in a list of variables
def treat_outliers_all(data, col_list):
"""
Treat outliers in a list of variables
data: dataframe
col_list: list of dataframe columns
"""
for col in col_list:
data = treat_outliers(df, col)
return data
# Following code need to be executed only if we need to fix all outliers. Currently we are moving ahead with Outliers
#numerical_col = dfcopy.select_dtypes(include=np.number).columns.tolist()
#data = treat_outliers_all(dfcopy, numerical_col)
# Create dummy variables
# Use one Hot Encoding for columns where value is 0 and 1 and for Family and Education
dummy_data = pd.get_dummies(dfcopy, columns=["Education", 'Securities_Account','CD_Account','Online' , 'CreditCard','Family'], drop_first=True)
dummy_data.head()
| Age | Income | CCAvg | Mortgage | Personal_Loan | Education_2 | Education_3 | Securities_Account_1 | CD_Account_1 | Online_1 | CreditCard_1 | Family_2 | Family_3 | Family_4 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 25 | 49 | 1.6 | 0 | 0 | False | False | True | False | False | False | False | False | True |
| 1 | 45 | 34 | 1.5 | 0 | 0 | False | False | True | False | False | False | False | True | False |
| 2 | 39 | 11 | 1.0 | 0 | 0 | False | False | False | False | False | False | False | False | False |
| 3 | 35 | 100 | 2.7 | 0 | 0 | True | False | False | False | False | False | False | False | False |
| 4 | 35 | 45 | 1.0 | 0 | 0 | True | False | False | False | False | True | False | False | True |
# We will split the dataset into dependent and independent variable sets
X = dummy_data.drop(["Personal_Loan"], axis=1)
Y = dummy_data["Personal_Loan"]
X.info() # returns summary of Independent features in a data set
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 5000 non-null int64 1 Income 5000 non-null int64 2 CCAvg 5000 non-null float64 3 Mortgage 5000 non-null int64 4 Education_2 5000 non-null bool 5 Education_3 5000 non-null bool 6 Securities_Account_1 5000 non-null bool 7 CD_Account_1 5000 non-null bool 8 Online_1 5000 non-null bool 9 CreditCard_1 5000 non-null bool 10 Family_2 5000 non-null bool 11 Family_3 5000 non-null bool 12 Family_4 5000 non-null bool dtypes: bool(9), float64(1), int64(3) memory usage: 200.3 KB
Y.info() # returns summary of of dependent variable which is Personal loan
<class 'pandas.core.series.Series'> RangeIndex: 5000 entries, 0 to 4999 Series name: Personal_Loan Non-Null Count Dtype -------------- ----- 5000 non-null int64 dtypes: int64(1) memory usage: 39.2 KB
# Splitting data in train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.30, random_state=1)
# Print summary for train and test data set
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True)*100)
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True)*100)
Shape of Training set : (3500, 13) Shape of test set : (1500, 13) Percentage of classes in training set: Personal_Loan 0 90.542857 1 9.457143 Name: proportion, dtype: float64 Percentage of classes in test set: Personal_Loan 0 90.066667 1 9.933333 Name: proportion, dtype: float64
We have split the dataset into Training and Testing dataset.
In both datasets, the target variable has around 90:10 distribution for the values 0 and 1 respectively.
Case Predictions:
Case 1 - Predicting a Customer will buy a loan but in reality he actually doesn't buy loan - Loss of Resource (FP)
Case 2 - Predicting a Customer will not buy a loan but he in reality customer does buy a loan - Loss of Opportunity (FN)
Which case is more important
The purpose of Loan Campaign is to bring more customers. If a customer is missed by sales team is an important criteria here so Case 2 is more important
How to reduce this loss of Oppurtinity or False Negative(FN)
The company would want the recall to be maximized, greater the recall score higher are the chances of minimizing the False Negatives.
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
# List all features in a list callled feature_names
feature_names = list(X_train.columns)
print(feature_names)
['Age', 'Income', 'CCAvg', 'Mortgage', 'Education_2', 'Education_3', 'Securities_Account_1', 'CD_Account_1', 'Online_1', 'CreditCard_1', 'Family_2', 'Family_3', 'Family_4']
Build Default Decision Tree Model
Build tree using DecisionTreeClassifier function without class weights
# Build a default Decision Tree
model = DecisionTreeClassifier(criterion = 'gini', random_state=1)
model.fit(X_train, y_train)
DecisionTreeClassifier(random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier(random_state=1)
# Build Confusion matrix for train and test model without weights
confusion_matrix_sklearn(model, X_train, y_train)
confusion_matrix_sklearn(model, X_test, y_test)
# Print different metrics for default tree for taining data
decision_tree_perf_train_without = model_performance_classification_sklearn(
model, X_train, y_train
)
decision_tree_perf_train_without
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.0 | 1.0 | 1.0 | 1.0 |
# Print different metrics for default tree for testing data
decision_tree_perf_test_without = model_performance_classification_sklearn(
model, X_test, y_test
)
decision_tree_perf_test_without
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.979333 | 0.899329 | 0.893333 | 0.896321 |
# Check the depth of the tree
model.get_depth()
10
# Shows the tree
plt.figure(figsize=(20, 30))
out = tree.plot_tree(
model,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Report showing the rules of a decision tree
print(tree.export_text(model, feature_names=feature_names, show_weights=True))
|--- Income <= 116.50 | |--- CCAvg <= 2.95 | | |--- Income <= 106.50 | | | |--- weights: [2553.00, 0.00] class: 0 | | |--- Income > 106.50 | | | |--- Family_4 <= 0.50 | | | | |--- Education_3 <= 0.50 | | | | | |--- Age <= 28.50 | | | | | | |--- Education_2 <= 0.50 | | | | | | | |--- weights: [5.00, 0.00] class: 0 | | | | | | |--- Education_2 > 0.50 | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | |--- Age > 28.50 | | | | | | |--- weights: [43.00, 0.00] class: 0 | | | | |--- Education_3 > 0.50 | | | | | |--- Mortgage <= 231.00 | | | | | | |--- CCAvg <= 1.95 | | | | | | | |--- weights: [14.00, 0.00] class: 0 | | | | | | |--- CCAvg > 1.95 | | | | | | | |--- CCAvg <= 2.65 | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | |--- CCAvg > 2.65 | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | |--- Mortgage > 231.00 | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | |--- Family_4 > 0.50 | | | | |--- Age <= 32.50 | | | | | |--- CCAvg <= 2.40 | | | | | | |--- weights: [12.00, 0.00] class: 0 | | | | | |--- CCAvg > 2.40 | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | |--- Age > 32.50 | | | | | |--- Age <= 60.00 | | | | | | |--- weights: [0.00, 6.00] class: 1 | | | | | |--- Age > 60.00 | | | | | | |--- weights: [4.00, 0.00] class: 0 | |--- CCAvg > 2.95 | | |--- Income <= 92.50 | | | |--- CD_Account_1 <= 0.50 | | | | |--- Age <= 26.50 | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | |--- Age > 26.50 | | | | | |--- CCAvg <= 3.55 | | | | | | |--- CCAvg <= 3.35 | | | | | | | |--- Age <= 37.50 | | | | | | | | |--- Education_2 <= 0.50 | | | | | | | | | |--- weights: [3.00, 0.00] class: 0 | | | | | | | | |--- Education_2 > 0.50 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | |--- Age > 37.50 | | | | | | | | |--- Income <= 82.50 | | | | | | | | | |--- weights: [23.00, 0.00] class: 0 | | | | | | | | |--- Income > 82.50 | | | | | | | | | |--- Income <= 83.50 | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | | |--- Income > 83.50 | | | | | | | | | | |--- weights: [5.00, 0.00] class: 0 | | | | | | |--- CCAvg > 3.35 | | | | | | | |--- Family_4 <= 0.50 | | | | | | | | |--- weights: [0.00, 5.00] class: 1 | | | | | | | |--- Family_4 > 0.50 | | | | | | | | |--- weights: [9.00, 0.00] class: 0 | | | | | |--- CCAvg > 3.55 | | | | | | |--- Family_3 <= 0.50 | | | | | | | |--- Mortgage <= 93.50 | | | | | | | | |--- weights: [53.00, 0.00] class: 0 | | | | | | | |--- Mortgage > 93.50 | | | | | | | | |--- Mortgage <= 99.50 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | |--- Mortgage > 99.50 | | | | | | | | | |--- weights: [21.00, 0.00] class: 0 | | | | | | |--- Family_3 > 0.50 | | | | | | | |--- Online_1 <= 0.50 | | | | | | | | |--- weights: [3.00, 0.00] class: 0 | | | | | | | |--- Online_1 > 0.50 | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | |--- CD_Account_1 > 0.50 | | | | |--- weights: [0.00, 5.00] class: 1 | | |--- Income > 92.50 | | | |--- Education_3 <= 0.50 | | | | |--- Education_2 <= 0.50 | | | | | |--- CD_Account_1 <= 0.50 | | | | | | |--- Family_4 <= 0.50 | | | | | | | |--- Online_1 <= 0.50 | | | | | | | | |--- Income <= 102.00 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | |--- Income > 102.00 | | | | | | | | | |--- Family_3 <= 0.50 | | | | | | | | | | |--- weights: [12.00, 0.00] class: 0 | | | | | | | | | |--- Family_3 > 0.50 | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | |--- Online_1 > 0.50 | | | | | | | | |--- weights: [20.00, 0.00] class: 0 | | | | | | |--- Family_4 > 0.50 | | | | | | | |--- CCAvg <= 4.20 | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | | |--- CCAvg > 4.20 | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | |--- CD_Account_1 > 0.50 | | | | | | |--- Income <= 93.50 | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | |--- Income > 93.50 | | | | | | | |--- weights: [0.00, 5.00] class: 1 | | | | |--- Education_2 > 0.50 | | | | | |--- Age <= 60.50 | | | | | | |--- weights: [0.00, 10.00] class: 1 | | | | | |--- Age > 60.50 | | | | | | |--- weights: [4.00, 0.00] class: 0 | | | |--- Education_3 > 0.50 | | | | |--- Mortgage <= 172.00 | | | | | |--- CD_Account_1 <= 0.50 | | | | | | |--- weights: [0.00, 14.00] class: 1 | | | | | |--- CD_Account_1 > 0.50 | | | | | | |--- Family_3 <= 0.50 | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | |--- Family_3 > 0.50 | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | |--- Mortgage > 172.00 | | | | | |--- Income <= 100.00 | | | | | | |--- Family_3 <= 0.50 | | | | | | | |--- weights: [5.00, 0.00] class: 0 | | | | | | |--- Family_3 > 0.50 | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | |--- Income > 100.00 | | | | | | |--- weights: [0.00, 2.00] class: 1 |--- Income > 116.50 | |--- Education_3 <= 0.50 | | |--- Education_2 <= 0.50 | | | |--- Family_3 <= 0.50 | | | | |--- Family_4 <= 0.50 | | | | | |--- weights: [375.00, 0.00] class: 0 | | | | |--- Family_4 > 0.50 | | | | | |--- weights: [0.00, 14.00] class: 1 | | | |--- Family_3 > 0.50 | | | | |--- weights: [0.00, 33.00] class: 1 | | |--- Education_2 > 0.50 | | | |--- weights: [0.00, 108.00] class: 1 | |--- Education_3 > 0.50 | | |--- weights: [0.00, 114.00] class: 1
# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )
print(
pd.DataFrame(
model.feature_importances_, columns=["Imp"], index=X_train.columns
).sort_values(by="Imp", ascending=False)
)
Imp Income 0.309934 Education_2 0.240847 Education_3 0.165787 Family_3 0.103121 Family_4 0.062971 CCAvg 0.050616 Age 0.026589 CD_Account_1 0.026348 Mortgage 0.010725 Online_1 0.003063 Securities_Account_1 0.000000 CreditCard_1 0.000000 Family_2 0.000000
#Visualizes the important features
importances = model.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(6, 6))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="green", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Model is able to perfectly classify all the data points on the training data set
No errors on the training dataset, each sample is classified correctly resulting in 100% recall score and precision score
As we know a decision tree will continue to grow and classify each data point correctly if no restrictions are applied as the trees will learn all the patterns in the training set.
This generally leads to overfitting of the model as Decision tree will perform best on training set.
Recall score on the Default decision tree performed 89.93% on recall score on its test data set
Model is giving 100% on training and 89.93% on test set.
As per default decision tree model without class weights - Income is the most important feature, followed by Education and Family
# Build a default Decision Tree
weightmodel = DecisionTreeClassifier(criterion = 'gini', class_weight = {0:10,1:90}, random_state=1)
weightmodel.fit(X_train, y_train)
DecisionTreeClassifier(class_weight={0: 10, 1: 90}, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. DecisionTreeClassifier(class_weight={0: 10, 1: 90}, random_state=1)# Build Confusion matrix for train and test model with weights
confusion_matrix_sklearn(weightmodel, X_train, y_train)
confusion_matrix_sklearn(weightmodel, X_test, y_test)
# Print different metrics for default tree with class weight for taining data
decision_tree_perf_train = model_performance_classification_sklearn(
weightmodel, X_train, y_train
)
decision_tree_perf_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.0 | 1.0 | 1.0 | 1.0 |
# Print different metrics for default tree with class weight for testing data
decision_tree_perf_test = model_performance_classification_sklearn(
weightmodel, X_test, y_test
)
decision_tree_perf_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.978 | 0.85906 | 0.914286 | 0.885813 |
# Check the depth of the tree
weightmodel.get_depth()
15
# Shows the tree
plt.figure(figsize=(20, 30))
out = tree.plot_tree(
weightmodel,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Report showing the rules of a decision tree -
print(tree.export_text(weightmodel, feature_names=feature_names, show_weights=True))
|--- Income <= 92.50 | |--- CCAvg <= 2.95 | | |--- weights: [24350.00, 0.00] class: 0 | |--- CCAvg > 2.95 | | |--- CD_Account_1 <= 0.50 | | | |--- CCAvg <= 3.95 | | | | |--- Mortgage <= 102.50 | | | | | |--- CCAvg <= 3.05 | | | | | | |--- weights: [150.00, 0.00] class: 0 | | | | | |--- CCAvg > 3.05 | | | | | | |--- Family_4 <= 0.50 | | | | | | | |--- Income <= 67.00 | | | | | | | | |--- weights: [80.00, 0.00] class: 0 | | | | | | | |--- Income > 67.00 | | | | | | | | |--- Securities_Account_1 <= 0.50 | | | | | | | | | |--- Income <= 84.00 | | | | | | | | | | |--- Income <= 82.50 | | | | | | | | | | | |--- truncated branch of depth 5 | | | | | | | | | | |--- Income > 82.50 | | | | | | | | | | | |--- weights: [0.00, 450.00] class: 1 | | | | | | | | | |--- Income > 84.00 | | | | | | | | | | |--- Income <= 90.50 | | | | | | | | | | | |--- weights: [50.00, 0.00] class: 0 | | | | | | | | | | |--- Income > 90.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | |--- Securities_Account_1 > 0.50 | | | | | | | | | |--- weights: [40.00, 0.00] class: 0 | | | | | | |--- Family_4 > 0.50 | | | | | | | |--- weights: [130.00, 0.00] class: 0 | | | | |--- Mortgage > 102.50 | | | | | |--- weights: [210.00, 0.00] class: 0 | | | |--- CCAvg > 3.95 | | | | |--- weights: [420.00, 0.00] class: 0 | | |--- CD_Account_1 > 0.50 | | | |--- weights: [0.00, 450.00] class: 1 |--- Income > 92.50 | |--- Education_3 <= 0.50 | | |--- Education_2 <= 0.50 | | | |--- Family_3 <= 0.50 | | | | |--- Family_4 <= 0.50 | | | | | |--- Income <= 103.50 | | | | | | |--- CCAvg <= 3.21 | | | | | | | |--- weights: [400.00, 0.00] class: 0 | | | | | | |--- CCAvg > 3.21 | | | | | | | |--- Income <= 97.00 | | | | | | | | |--- weights: [30.00, 0.00] class: 0 | | | | | | | |--- Income > 97.00 | | | | | | | | |--- Age <= 39.00 | | | | | | | | | |--- weights: [20.00, 0.00] class: 0 | | | | | | | | |--- Age > 39.00 | | | | | | | | | |--- weights: [0.00, 270.00] class: 1 | | | | | |--- Income > 103.50 | | | | | | |--- weights: [4330.00, 0.00] class: 0 | | | | |--- Family_4 > 0.50 | | | | | |--- Income <= 93.50 | | | | | | |--- weights: [10.00, 0.00] class: 0 | | | | | |--- Income > 93.50 | | | | | | |--- Income <= 102.00 | | | | | | | |--- CreditCard_1 <= 0.50 | | | | | | | | |--- weights: [10.00, 0.00] class: 0 | | | | | | | |--- CreditCard_1 > 0.50 | | | | | | | | |--- weights: [0.00, 90.00] class: 1 | | | | | | |--- Income > 102.00 | | | | | | | |--- weights: [0.00, 1710.00] class: 1 | | | |--- Family_3 > 0.50 | | | | |--- Income <= 108.50 | | | | | |--- weights: [110.00, 0.00] class: 0 | | | | |--- Income > 108.50 | | | | | |--- Age <= 26.00 | | | | | | |--- weights: [10.00, 0.00] class: 0 | | | | | |--- Age > 26.00 | | | | | | |--- Income <= 118.00 | | | | | | | |--- Online_1 <= 0.50 | | | | | | | | |--- weights: [0.00, 180.00] class: 1 | | | | | | | |--- Online_1 > 0.50 | | | | | | | | |--- weights: [20.00, 0.00] class: 0 | | | | | | |--- Income > 118.00 | | | | | | | |--- weights: [0.00, 2970.00] class: 1 | | |--- Education_2 > 0.50 | | | |--- Income <= 110.50 | | | | |--- CCAvg <= 2.90 | | | | | |--- Income <= 106.50 | | | | | | |--- weights: [400.00, 0.00] class: 0 | | | | | |--- Income > 106.50 | | | | | | |--- Age <= 52.00 | | | | | | | |--- weights: [50.00, 0.00] class: 0 | | | | | | |--- Age > 52.00 | | | | | | | |--- Family_3 <= 0.50 | | | | | | | | |--- weights: [0.00, 90.00] class: 1 | | | | | | | |--- Family_3 > 0.50 | | | | | | | | |--- weights: [20.00, 0.00] class: 0 | | | | |--- CCAvg > 2.90 | | | | | |--- Age <= 55.00 | | | | | | |--- weights: [0.00, 540.00] class: 1 | | | | | |--- Age > 55.00 | | | | | | |--- weights: [20.00, 0.00] class: 0 | | | |--- Income > 110.50 | | | | |--- Income <= 116.50 | | | | | |--- Mortgage <= 141.50 | | | | | | |--- Age <= 60.50 | | | | | | | |--- CCAvg <= 1.20 | | | | | | | | |--- weights: [20.00, 0.00] class: 0 | | | | | | | |--- CCAvg > 1.20 | | | | | | | | |--- CCAvg <= 2.65 | | | | | | | | | |--- CCAvg <= 1.75 | | | | | | | | | | |--- weights: [0.00, 180.00] class: 1 | | | | | | | | | |--- CCAvg > 1.75 | | | | | | | | | | |--- weights: [30.00, 0.00] class: 0 | | | | | | | | |--- CCAvg > 2.65 | | | | | | | | | |--- weights: [0.00, 450.00] class: 1 | | | | | | |--- Age > 60.50 | | | | | | | |--- weights: [30.00, 0.00] class: 0 | | | | | |--- Mortgage > 141.50 | | | | | | |--- weights: [40.00, 0.00] class: 0 | | | | |--- Income > 116.50 | | | | | |--- weights: [0.00, 9720.00] class: 1 | |--- Education_3 > 0.50 | | |--- Income <= 116.50 | | | |--- CCAvg <= 2.35 | | | | |--- Mortgage <= 236.00 | | | | | |--- weights: [400.00, 0.00] class: 0 | | | | |--- Mortgage > 236.00 | | | | | |--- Income <= 110.00 | | | | | | |--- weights: [40.00, 0.00] class: 0 | | | | | |--- Income > 110.00 | | | | | | |--- Age <= 34.50 | | | | | | | |--- weights: [10.00, 0.00] class: 0 | | | | | | |--- Age > 34.50 | | | | | | | |--- weights: [0.00, 180.00] class: 1 | | | |--- CCAvg > 2.35 | | | | |--- Age <= 64.00 | | | | | |--- CCAvg <= 2.95 | | | | | | |--- CCAvg <= 2.55 | | | | | | | |--- CreditCard_1 <= 0.50 | | | | | | | | |--- weights: [0.00, 180.00] class: 1 | | | | | | | |--- CreditCard_1 > 0.50 | | | | | | | | |--- weights: [10.00, 0.00] class: 0 | | | | | | |--- CCAvg > 2.55 | | | | | | | |--- weights: [60.00, 0.00] class: 0 | | | | | |--- CCAvg > 2.95 | | | | | | |--- Mortgage <= 172.00 | | | | | | | |--- CD_Account_1 <= 0.50 | | | | | | | | |--- weights: [0.00, 1260.00] class: 1 | | | | | | | |--- CD_Account_1 > 0.50 | | | | | | | | |--- Age <= 51.50 | | | | | | | | | |--- weights: [20.00, 0.00] class: 0 | | | | | | | | |--- Age > 51.50 | | | | | | | | | |--- weights: [0.00, 90.00] class: 1 | | | | | | |--- Mortgage > 172.00 | | | | | | | |--- Age <= 39.00 | | | | | | | | |--- weights: [30.00, 0.00] class: 0 | | | | | | | |--- Age > 39.00 | | | | | | | | |--- Mortgage <= 199.00 | | | | | | | | | |--- weights: [10.00, 0.00] class: 0 | | | | | | | | |--- Mortgage > 199.00 | | | | | | | | | |--- Income <= 97.00 | | | | | | | | | | |--- weights: [10.00, 0.00] class: 0 | | | | | | | | | |--- Income > 97.00 | | | | | | | | | | |--- weights: [0.00, 270.00] class: 1 | | | | |--- Age > 64.00 | | | | | |--- weights: [30.00, 0.00] class: 0 | | |--- Income > 116.50 | | | |--- weights: [0.00, 10260.00] class: 1
# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )
print(
pd.DataFrame(
weightmodel.feature_importances_, columns=["Imp"], index=X_train.columns
).sort_values(by="Imp", ascending=False)
)
Imp Income 0.633250 CCAvg 0.094742 Family_4 0.080839 Education_2 0.066285 Family_3 0.063748 Education_3 0.023947 Mortgage 0.013352 Age 0.011674 CD_Account_1 0.007908 Securities_Account_1 0.001879 CreditCard_1 0.001203 Online_1 0.001172 Family_2 0.000000
#Visualizes the important features
importances = weightmodel.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(6, 6))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="green", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
As per default decision tree model with class weights Income is the most important feature, followed by CCAvg and Family
It is a very complex tree, and appears to be over-fitting, recall score on train set is 100%, and on test is 85.90%
Tree is much more complex after adding weights with depth of 15
Limiting the max depth to 6 and making tree more simpler
Default Decision tree was build with max_depth 10 - Prepruning it to little more tha 50%
# Build a Decision Tree with max depth as 6 (little more than 50 % of default tree) -
limitmodel = DecisionTreeClassifier(criterion = 'gini',max_depth=6,random_state=1)
limitmodel.fit(X_train, y_train)
DecisionTreeClassifier(max_depth=6, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier(max_depth=6, random_state=1)
# Build Confusion matrix for train and test model
confusion_matrix_sklearn(limitmodel, X_train, y_train)
confusion_matrix_sklearn(limitmodel, X_test, y_test)
# Print different metrics for prePrune tree for taining data
decision_tree_limit_perf_train = model_performance_classification_sklearn(limitmodel, X_train, y_train)
decision_tree_limit_perf_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.994857 | 0.94864 | 0.996825 | 0.972136 |
# Print different metrics for Pre Prune tree for test data
decision_tree_limit_perf_test = model_performance_classification_sklearn(limitmodel, X_test, y_test)
decision_tree_limit_perf_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.981333 | 0.872483 | 0.935252 | 0.902778 |
# Check the depth of the tree
limitmodel.get_depth()
6
# Shows the tree
plt.figure(figsize=(10, 10))
out = tree.plot_tree(
limitmodel,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Report showing the rules of a decision tree
print(tree.export_text(limitmodel, feature_names=feature_names, show_weights=True))
|--- Income <= 116.50 | |--- CCAvg <= 2.95 | | |--- Income <= 106.50 | | | |--- weights: [2553.00, 0.00] class: 0 | | |--- Income > 106.50 | | | |--- Family_4 <= 0.50 | | | | |--- Education_3 <= 0.50 | | | | | |--- Age <= 28.50 | | | | | | |--- weights: [5.00, 1.00] class: 0 | | | | | |--- Age > 28.50 | | | | | | |--- weights: [43.00, 0.00] class: 0 | | | | |--- Education_3 > 0.50 | | | | | |--- Mortgage <= 231.00 | | | | | | |--- weights: [15.00, 1.00] class: 0 | | | | | |--- Mortgage > 231.00 | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | |--- Family_4 > 0.50 | | | | |--- Age <= 32.50 | | | | | |--- CCAvg <= 2.40 | | | | | | |--- weights: [12.00, 0.00] class: 0 | | | | | |--- CCAvg > 2.40 | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | |--- Age > 32.50 | | | | | |--- Age <= 60.00 | | | | | | |--- weights: [0.00, 6.00] class: 1 | | | | | |--- Age > 60.00 | | | | | | |--- weights: [4.00, 0.00] class: 0 | |--- CCAvg > 2.95 | | |--- Income <= 92.50 | | | |--- CD_Account_1 <= 0.50 | | | | |--- Age <= 26.50 | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | |--- Age > 26.50 | | | | | |--- CCAvg <= 3.55 | | | | | | |--- weights: [40.00, 7.00] class: 0 | | | | | |--- CCAvg > 3.55 | | | | | | |--- weights: [77.00, 2.00] class: 0 | | | |--- CD_Account_1 > 0.50 | | | | |--- weights: [0.00, 5.00] class: 1 | | |--- Income > 92.50 | | | |--- Education_3 <= 0.50 | | | | |--- Education_2 <= 0.50 | | | | | |--- CD_Account_1 <= 0.50 | | | | | | |--- weights: [33.00, 4.00] class: 0 | | | | | |--- CD_Account_1 > 0.50 | | | | | | |--- weights: [1.00, 5.00] class: 1 | | | | |--- Education_2 > 0.50 | | | | | |--- Age <= 60.50 | | | | | | |--- weights: [0.00, 10.00] class: 1 | | | | | |--- Age > 60.50 | | | | | | |--- weights: [4.00, 0.00] class: 0 | | | |--- Education_3 > 0.50 | | | | |--- Mortgage <= 172.00 | | | | | |--- CD_Account_1 <= 0.50 | | | | | | |--- weights: [0.00, 14.00] class: 1 | | | | | |--- CD_Account_1 > 0.50 | | | | | | |--- weights: [2.00, 1.00] class: 0 | | | | |--- Mortgage > 172.00 | | | | | |--- Income <= 100.00 | | | | | | |--- weights: [5.00, 1.00] class: 0 | | | | | |--- Income > 100.00 | | | | | | |--- weights: [0.00, 2.00] class: 1 |--- Income > 116.50 | |--- Education_3 <= 0.50 | | |--- Education_2 <= 0.50 | | | |--- Family_3 <= 0.50 | | | | |--- Family_4 <= 0.50 | | | | | |--- weights: [375.00, 0.00] class: 0 | | | | |--- Family_4 > 0.50 | | | | | |--- weights: [0.00, 14.00] class: 1 | | | |--- Family_3 > 0.50 | | | | |--- weights: [0.00, 33.00] class: 1 | | |--- Education_2 > 0.50 | | | |--- weights: [0.00, 108.00] class: 1 | |--- Education_3 > 0.50 | | |--- weights: [0.00, 114.00] class: 1
# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )
print(
pd.DataFrame(
limitmodel.feature_importances_, columns=["Imp"], index=X_train.columns
).sort_values(by="Imp", ascending=False)
)
Imp Income 0.317812 Education_2 0.248480 Education_3 0.174878 Family_3 0.099498 Family_4 0.051526 CCAvg 0.044703 CD_Account_1 0.027792 Age 0.027472 Mortgage 0.007840 Securities_Account_1 0.000000 Online_1 0.000000 CreditCard_1 0.000000 Family_2 0.000000
#Visualizes the important features
importances = limitmodel.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
As per decision tree with max depth as 6 model - Income is the most important feature, followed by Education and Family size
The tree is simpler but with recall score of 94.86% on train set and 87.24 on test data set
Using GridSearch for hyperpaprameter tuning of our tree model
Grid search is a tuning technique that attempts to compute the optimum values of hyperparameters.
# Choose the type of classifier.
premodel = DecisionTreeClassifier(criterion = 'gini', random_state=1)
# Grid of parameters to choose from
parameters = {
"max_depth": np.arange(3, 10),
"min_samples_leaf": [1, 2, 5, 7, 10],
"max_leaf_nodes": [2, 3, 5, 10],
}
# Type of scoring used to compare parameter combinations
acc_scorer = make_scorer(recall_score)
# Run the grid search
grid_obj = GridSearchCV(premodel, parameters, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
premodel = grid_obj.best_estimator_
# Fit the best algorithm to the data.
premodel.fit(X_train,y_train)
DecisionTreeClassifier(max_depth=5, max_leaf_nodes=10, min_samples_leaf=5,
random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. DecisionTreeClassifier(max_depth=5, max_leaf_nodes=10, min_samples_leaf=5,
random_state=1)# Build Confusion matrix for train and test model without weights
confusion_matrix_sklearn(premodel, X_train, y_train)
confusion_matrix_sklearn(premodel, X_test, y_test)
# Print different metrics for PrePrune tree for taining data
decision_tree_pre_perf_train = model_performance_classification_sklearn(premodel, X_train, y_train)
decision_tree_pre_perf_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.986857 | 0.882175 | 0.976589 | 0.926984 |
# Print different metrics for PrePrune tree for test data
decision_tree_pre_perf_test = model_performance_classification_sklearn(premodel, X_test, y_test)
decision_tree_pre_perf_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.974667 | 0.791946 | 0.944 | 0.861314 |
# Check the depth of the tree
premodel.get_depth()
5
# Shows the tree
plt.figure(figsize=(10, 10))
out = tree.plot_tree(
premodel,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Report showing the rules of a decision tree
print(tree.export_text(premodel, feature_names=feature_names, show_weights=True))
|--- Income <= 116.50 | |--- CCAvg <= 2.95 | | |--- weights: [2632.00, 10.00] class: 0 | |--- CCAvg > 2.95 | | |--- Income <= 92.50 | | | |--- CD_Account_1 <= 0.50 | | | | |--- weights: [117.00, 10.00] class: 0 | | | |--- CD_Account_1 > 0.50 | | | | |--- weights: [0.00, 5.00] class: 1 | | |--- Income > 92.50 | | | |--- Education_3 <= 0.50 | | | | |--- weights: [38.00, 19.00] class: 0 | | | |--- Education_3 > 0.50 | | | | |--- weights: [7.00, 18.00] class: 1 |--- Income > 116.50 | |--- Education_3 <= 0.50 | | |--- Education_2 <= 0.50 | | | |--- Family_3 <= 0.50 | | | | |--- Family_4 <= 0.50 | | | | | |--- weights: [375.00, 0.00] class: 0 | | | | |--- Family_4 > 0.50 | | | | | |--- weights: [0.00, 14.00] class: 1 | | | |--- Family_3 > 0.50 | | | | |--- weights: [0.00, 33.00] class: 1 | | |--- Education_2 > 0.50 | | | |--- weights: [0.00, 108.00] class: 1 | |--- Education_3 > 0.50 | | |--- weights: [0.00, 114.00] class: 1
# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )
print(
pd.DataFrame(
premodel.feature_importances_, columns=["Imp"], index=X_train.columns
).sort_values(by="Imp", ascending=False)
)
Imp Income 0.335478 Education_2 0.258373 Education_3 0.188599 Family_3 0.107563 Family_4 0.051352 CCAvg 0.043100 CD_Account_1 0.015535 Age 0.000000 Mortgage 0.000000 Securities_Account_1 0.000000 Online_1 0.000000 CreditCard_1 0.000000 Family_2 0.000000
#Visualizes the important features
importances = premodel.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
In this case Pre pruning a Decision tree using grid search parametrs Income is the most important feature followed by Education and Family
It is a much simpler Decision Tree with max_depth of only 5
Recall score of 88.21% on train set and 79.19% on test set
# Creates a tree and return ccp_alpha with impurities
clf = DecisionTreeClassifier(random_state=1)
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
pd.DataFrame(path).T
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ccp_alphas | 0.0 | 0.000268 | 0.000268 | 0.000275 | 0.000278 | 0.000359 | 0.000381 | 0.000381 | 0.000476 | 0.000476 | ... | 0.000882 | 0.001552 | 0.001552 | 0.002333 | 0.003294 | 0.006473 | 0.007712 | 0.016154 | 0.032821 | 0.047088 |
| impurities | 0.0 | 0.000536 | 0.001609 | 0.002710 | 0.003824 | 0.004900 | 0.005280 | 0.005661 | 0.006138 | 0.006614 | ... | 0.016352 | 0.017903 | 0.022560 | 0.024893 | 0.028187 | 0.034659 | 0.042372 | 0.058525 | 0.124167 | 0.171255 |
2 rows × 26 columns
# Plot the ccp_alpha against impurities
fig, ax = plt.subplots(figsize=(10,5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker='o', drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()
Train the decision tree using the effective alphas. The last value
in ccp_alphas is the alpha value that prunes the whole tree,
leaving the tree, clfs[-1], with one node.
clfs = []
for ccp_alpha in ccp_alphas:
clf = DecisionTreeClassifier(random_state=1, ccp_alpha=ccp_alpha)
clf.fit(X_train, y_train)
clfs.append(clf)
print("Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
clfs[-1].tree_.node_count, ccp_alphas[-1]))
Number of nodes in the last tree is: 1 with ccp_alpha: 0.04708834100596768
For the remainder, we remove the last element in clfs and ccp_alphas, because it is the trivial tree with only one node. Here we show that the number of nodes and tree depth decreases as alpha increases.
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]
node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1,figsize=(16,12))
ax[0].plot(ccp_alphas, node_counts, marker='o', drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker='o', drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()
# Recall vs alpha on train set
recall_train = []
for clf in clfs:
pred_train = clf.predict(X_train)
values_train = recall_score(y_train,pred_train)
recall_train.append(values_train)
# Recall vs alpha on test set
recall_test = []
for clf in clfs:
pred_test = clf.predict(X_test)
values_test = recall_score(y_test,pred_test)
recall_test.append(values_test)
fig, ax = plt.subplots(figsize=(15,5))
ax.set_xlabel("alpha")
ax.set_ylabel("Recall")
ax.set_title("Recall vs alpha for training and testing sets")
ax.plot(ccp_alphas, recall_train, marker='o', label="train",
drawstyle="steps-post")
ax.plot(ccp_alphas, recall_test, marker='o', label="test",
drawstyle="steps-post")
ax.legend()
plt.show()
# creating the model where we get highest train and test recall
index_best_model = np.argmax(recall_test)
best_model = clfs[index_best_model]
print(best_model)
DecisionTreeClassifier(random_state=1)
# Build a Decision Tree using ccp_alpha above with class weights
postmodel = DecisionTreeClassifier(
ccp_alpha=0.04708834100596768, class_weight={0: 0.10, 1: 0.90}, random_state=1
)
postmodel.fit(X_train, y_train)
DecisionTreeClassifier(ccp_alpha=0.04708834100596768,
class_weight={0: 0.1, 1: 0.9}, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. DecisionTreeClassifier(ccp_alpha=0.04708834100596768,
class_weight={0: 0.1, 1: 0.9}, random_state=1)# Build Confusion matrix for train and test model
confusion_matrix_sklearn(postmodel, X_train, y_train)
confusion_matrix_sklearn(postmodel, X_test, y_test)
# Print different metrics for Decision tree after post pruning for taining data
decision_tree_post_perf_train = model_performance_classification_sklearn(
postmodel, X_train, y_train
)
decision_tree_post_perf_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.819429 | 0.954683 | 0.338692 | 0.5 |
# Print different metrics for Decision tree after post pruning for test data
decision_tree_post_perf_test = model_performance_classification_sklearn(
postmodel, X_test, y_test
)
decision_tree_post_perf_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.8 | 0.932886 | 0.324009 | 0.480969 |
# Plot the tree
plt.figure(figsize=(10, 10))
out = tree.plot_tree(
postmodel,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Report showing the rules of a decision tree
print(tree.export_text(postmodel, feature_names=feature_names, show_weights=True))
|--- Income <= 92.50 | |--- weights: [255.20, 13.50] class: 0 |--- Income > 92.50 | |--- weights: [61.70, 284.40] class: 1
#Visualizes the important features which should be only income now
importances = postmodel.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Income is most important fearture
In this case of post pruning a Decision tree Income is the most important feature for the customer
It is a very simple Decision with only 1 depth
Recall score of 95.46% on train set and 93.28% on test set
Model is giving good and generalized results on training and test set.
# training performance comparison
models_train_comp_df = pd.concat(
[
decision_tree_perf_train_without.T,
decision_tree_perf_train.T,
decision_tree_limit_perf_train.T,
decision_tree_pre_perf_train.T,
decision_tree_post_perf_train.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Decision Tree without class_weight",
"Decision Tree with class_weight",
"Decision Tree (Pre-Pruning with Limit)",
"Decision Tree (Pre-Pruning with GridSearch)",
"Decision Tree (Post-Pruning)",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Decision Tree without class_weight | Decision Tree with class_weight | Decision Tree (Pre-Pruning with Limit) | Decision Tree (Pre-Pruning with GridSearch) | Decision Tree (Post-Pruning) | |
|---|---|---|---|---|---|
| Accuracy | 1.0 | 1.0 | 0.994857 | 0.986857 | 0.819429 |
| Recall | 1.0 | 1.0 | 0.948640 | 0.882175 | 0.954683 |
| Precision | 1.0 | 1.0 | 0.996825 | 0.976589 | 0.338692 |
| F1 | 1.0 | 1.0 | 0.972136 | 0.926984 | 0.500000 |
# testing performance comparison
models_test_comp_df = pd.concat(
[
decision_tree_perf_test_without.T,
decision_tree_perf_test.T,
decision_tree_limit_perf_test.T,
decision_tree_pre_perf_test.T,
decision_tree_post_perf_test.T,
],
axis=1,
)
models_test_comp_df.columns = [
"Decision Tree without class_weight",
"Decision Tree with class_weight",
"Decision Tree (Pre-Pruning with Limit)",
"Decision Tree (Pre-Pruning with GridSearch)",
"Decision Tree (Post-Pruning)",
]
print("Testing performance comparison:")
models_test_comp_df
Testing performance comparison:
| Decision Tree without class_weight | Decision Tree with class_weight | Decision Tree (Pre-Pruning with Limit) | Decision Tree (Pre-Pruning with GridSearch) | Decision Tree (Post-Pruning) | |
|---|---|---|---|---|---|
| Accuracy | 0.979333 | 0.978000 | 0.981333 | 0.974667 | 0.800000 |
| Recall | 0.899329 | 0.859060 | 0.872483 | 0.791946 | 0.932886 |
| Precision | 0.893333 | 0.914286 | 0.935252 | 0.944000 | 0.324009 |
| F1 | 0.896321 | 0.885813 | 0.902778 | 0.861314 | 0.480969 |
We analyzed Personal Loan campaign using decision tree classifier to build a predictive model
The model can be used to predict if customer is going to borrow a personal loan or not
We visualized different trees and their confusion matrix to get better understanding of model
We did not remove the outliers and still got a simple post pruned decision tree
Income followed by Education and Family are most important factors to predict if customer is going to borrow a loan/not
Decision tree with post-pruning has given the highest recall score 95.46% on train set and (93.28%) on the test set.
The pre pruned and the post post pruned models have reduced overfitting and the model is giving generalized performance
The tree with post pruning much simpler and much easy to interpret.
What recommedations would you suggest to the bank?
The goal of sales team should be to minimize Loss of Oppurtinity i.e minimal chance of predicting customer will not purchase a loan but in reality customer do buy the loan. This is achieved by minimizing false negatives or maximizing Recall score
We used the decision tree model using default decision tree, used class weights in default decision tree then used pre pruning and post pruning decision tree
If we consider Accuracy score - Prepruned decission tree model is a way to go but if we have to consider recall score then we should use Post Pruned decision tree model
Marketing team should look at most imprtant features which is Income in predicting whether customer will borrow a loan or not. After Income - customers with Graduate or Advanced degree, customers having higher Family Size are some of the most important points for predicting the probability of a customer borrowing Personal Loan
Customers with High Income have high Expenditures whereas Customers with low income have low expenses
Sales /Marketing team should establish dedicated relationship with high profile customers